Executive Summary

Swiftkey develop a word prediction application that is used while typing into a keyboards on a mobile keyboard. When the user types: “I went to the” : the application presents three options for what the next word might be. For example, the three words might be gym, store, restaurant.

In this project, we use R to build a predictive model using text data provided by the Data Science Capstone course. The data consists of text from ‘Blogs’, ‘News’ and ‘Twitter’ totaling more than 4 million lines and ??? unique words.

TL;DR

In a nutshell, here’s a summary of the data analysis performed in this report.

  1. Load the raw data.
  2. Extract a 1% subsample of the data.
  3. Preprocess the data to remove stopword, convert to lowercase and other items.
  4. Generate 2-grams, 3-grams and 4-grams.
  5. Present a technique to use the Dirichlet-multinomial model as a language model.

Loading the data

First, I fetched the data from the URL provided by the course. Here are line counts per file.

$ wc -l data/final/en_US/*.txt
  899288 data/final/en_US/en_US.blogs.txt
 1010242 data/final/en_US/en_US.news.txt
 2360148 data/final/en_US/en_US.twitter.txt
 4269678 total

Explore the Data

First, we sample 1% of the lines in the files in order to speed up the data exploration. The implementation is in sample_capstone_data in sample_data.R. We use tm R package to load each sample file for analysis.

sample_vector_corpus <- get_sample_datums_vector_corpus()
content_stats_df <- do_explore_per_data_source(sample_vector_corpus)
content_stats_df
##         source num_lines num_unique_words mean_word_freq median_word_freq
## 1      twitter     23602             8040             20                9
## 2        blogs      8993            12414             15                6
## 3         news     10103            12850             15                7
## 4 all combined         3            22899             24                7

Preprocessing

We perform the following text processing steps prior to parsing ngrams.

  • Remove all Punctuation
  • Remove all Numbers
  • Convert all words to Lowercase
  • Remove English Stopwords
  • Strip extra whitespace
  • Remove Profanity

Word Frequency Distribution

For example, look at the word frequency distribution for twitter sample data

p <- twitter_word_plot(sample_vector_corpus)
print(p)

NGram Analysis

Here are top bigrams.

p <- ngrams_per_source_plot(sample_vector_corpus, num_gram=2)

Here are top tri-grams

p <- ngrams_per_source_plot(sample_vector_corpus, num_gram=3)

Here are top 4-grams

p <- ngrams_per_source_plot(sample_vector_corpus, num_gram=4)

NGram Language Model

We build a tree using the ngrams and compute MLE () using the Dirichlet-multinomial model. We use node.tree which can build a tree from a data.frame. Now lets perform a search for “data”.

Next Word Prediction for ‘data’

docs <- load_sample_dircorpus()
docs <- preprocess_entries(docs)
ngram_tree <- ngram_language_modeling(docs)
plot_tree_for_report(ngram_tree)

Here are the maximum likelihood estimates. They show 6% likelihood that entry will be the next word: “data entry” has a frequency = 12 and “data” has a frequency of 198 - so the maximimum likelihood estimate is 6.1%.

results <- perform_search(ngram_tree, c("data"))
print(results)
##                   12                   10                  
## recommended_words "entry"              "streams"           
## likelihood        "0.0606060606060606" "0.0505050505050505"
##                   8                    7                   
## recommended_words "recovery"           "dating"            
## likelihood        "0.0404040404040404" "0.0353535353535354"
##                   7                   
## recommended_words "personalize"       
## likelihood        "0.0353535353535354"

Next Word Prediction for ‘data entry’

Then if we query for “data entry”, we search the tree the nodes “data” then “entry” and we will recommend the words “just” and “respond”.

plot_tree_for_report(ngram_tree, highlight_child = TRUE)
results <- perform_search(ngram_tree, c("data", "entry"))
print(results)
##                   6      6        
## recommended_words "just" "respond"
## likelihood        "0.5"  "0.5"

Next Steps

  • Build a model using the more than a 1% sample.
  • Deploy the ngram tree to the server-side of an Shiny Application.